tp dp and pp

#tp #pp
Understanding the concepts of TP (Tensor Parallelism), DP (Data Parallelism), and PP (Pipeline Parallelism) in the context of Megatron involves grasping different strategies and optimizations used for distributed training of large-scale deep learning models, particularly in the domain of natural language processing (NLP). Let's delve into each of these concepts:

1. Tensor Parallelism (TP)

2. Data Parallelism (DP)

3. Pipeline Parallelism (PP)

Relationship between TP, DP, and PP in Megatron

In summary, understanding TP, DP, and PP in Megatron involves recognizing how these parallelization techniques are leveraged to distribute model parameters, training data, and computations across multiple devices, thereby enabling efficient and scalable training of large deep learning models in the context of NLP tasks. Each strategy plays a critical role in optimizing performance and resource utilization during distributed training.

dp = world_size / tp / pp

pp并行原理
Pasted image 20240626012411.png
Pasted image 20240626013930.png

https://www.bilibili.com/video/BV1WD4y1t7Ba/?spm_id_from=333.337.search-card.all.click&vd_source=9746697102ead983ecbe06ba12115f1e